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ABSTRACT 

We  consider  the  problem  of  bringing*  a  distributed  system  to  a  consistent  state  after 
transient  failures.  We  address  the  two  components  of  thig  problem  by  describing  a 
distributed  algorithm  to  create  consistent  checkpoints,  as  well  as  a  rollback- 
recovery  algorithm  to  recover  the  system  to  a  consistent  state.  In  contrast  to  previ¬ 
ous  algorithms,  they  tolerate  failures  that  occur  during  their  executions.  Further¬ 
more,  when  a  process  takes  a  checkpoint,  a  minimal  number  of  additional  processes 
are  forced  to  take  checkpoints.  Similarly,  when  a  process  restarts  after  a  failure,  a 
minimal  number  of  additional  processes  are  forced  to  restart  with  it.  Our  algo¬ 
rithms  require  each  process  to  store  at  most  two  checkpoints  in  stable  storage. 
This  storage  requirement  is  shown  to  be  minimal  under  general  assumptions.  , 

Keywords:  fault-tolerance,  checkpoint,  rollback-recovery,  distributed  systems,  con¬ 
sistent  state. 
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1.  Introduction  , 

j 

Checkpointing  and  rollback-recovery  are  well-known  techniques  that  allow  j 

processes  to  make  progress  in  spite  of  failures  [Rand78].  The  failures  under  con-  J 

sideration  are  transient  problems  such  as  hardware  errors  and  transaction  aborts,  j 

1. e.,  those  that  are  unlikely  to  recur  when  a  process  restarts.  With  this  scheme,  a  ‘ 

process  takes  a  checkpoint  from  time  to  time  by  saving  its  state  on  stable  storage  ! 

[Lamp79].  When  a  failure  occurs,  the  process  rolls  back  to  its  most  recent  check-  ! 

point,  assumes  the  state  saved  in  that  checkpoint,  and  resumes  execution. 

We  first  identify  consistency  problems  that  arise  in  applying  this  technique  to 
a  distributed  system.  We  then  propose  a  checkpoint  algorithm  and  a  rollback-  ; 

recovery  algorithm  to  restart  the  system  from  a  consistent  state  when  failures 
occur.  Our  algorithms  prevent  the  well-known  "domino  effect”  as  well  as  liveness 
problems  associated  with  rollback-recovery.  In  contrast  to  previous  algorithms, 
they  are  fault-tolerant  and  involve  a  minimal  number  of  processes.  With  our 
approach,  each  process  stores  at  most  two  checkpoints  in  stable  storage.  This 
storage  requirement  is  shown  to  be  minimal  under  general  assumptions. 

The  paper  is  organized  as  follows:  We  discuss  the  notion  of  consistency  in  a 
distributed  system  in  section  2  and  describe  our  system  model  in  section  3.  In  sec¬ 
tion  4  we  identify  the  problems  to  be  solved.  Sections  5  and  6  contain  the  check¬ 
point  and  rollback- recovery  algorithms  respectively.  The  algorithms  are  extended 
for  concurrent  executions  in  section  7.  In  section  8  we  consider  optimizations.  Sec¬ 
tions  9  and  10  contain  a  discussion  and  our  conclusion. 

2.  Consistent  Global  States  in  Distributed  Systems 

The  notion  of  a  consistent  global  state  is  central  to  reasoning  about  correct¬ 
ness  in  distributed  systems.  It  was  initially  studied  in  [Rand75,  Russ77,  Pres83] 
and  later  formalized  by  Chandy  and  Lamport  [Chan85].  We  summarise  the  ideas 
in  [Chan85]: 

In  a  distributed  computation,  an  event  at  a  process  p  can  be  a  spontaneous 
change  of  p’s  state,  or  the  sending  or  receipt  of  a  message  by  p.  Event  a 
directly  happens  before  event  6  if  and  only  if 

(1)  there  exist  states  sit  s2,  and  s3  such  that  event  a  changes  p’s  process  state 
from  to  s2  and  event  6  changes  p's  process  state  from  s2  to  s3;  or 

(2)  event  a  is  the  sending  of  a  message  m  by  a  process  p  and  event  6  is  the 
receiving  of  m  by  another  process  q . 


The  transitive  closure  of  the  directly  happens  before  relation  is  the  happens  before 
relation.  If  event  a  happens  before  event  b,  b  happens  after  a.  (We  abbreviate  hap¬ 
pens  beforet  "before”  and  happens  after,  "after”.) 

The  local  state  of  a  process  at  time  0  is  its  initial  state;  the  local  state  of  a  pro¬ 
cess  at  time  t  is  the  state  resulting  from  applying  the  sequence  of  events  occurring 
in  (0,  i}  to  its  initial  state.  If  a  process  has  failed  by  time  t,  its  local  state  at  t  is 
undefined.  A  global  state  of  a  system  at  t  is  the  set  of  all  processes’  local 
states  at  t.  The  state  of  a  channel  at  time  t  is  the  set  of  messages  sent  over  that 
channel  but  not  yet  received  at  t.  We  can  depict  the  occurrences  of  events  over 
time  with  a  time  diagram,  in  which  horizontal  lines  are  time  axes  of  processes, 
points  are  events,  and  arrows  represent  messages  from  the  sending  process  to  the 
receiving  process.  La  this  representation,  a  global  state  is  a  cut  dividing  the  time 
diagram  into  two  halves.  The  channel  states  are  the  arrows  (messages)  that  cross 
the  cut.  Figure  1  is  a  time  diagram  for  a  system  of  four  processes. 

Informally,  a  cut  (global  state)  in  the  time  diagram  is  consistent  if  no  arrow 
starts  on  the  right  hand  side  and  ends  on  the  left  hand  side  of  it  This  notion  of 
consistency  fits  the  observation  that  a  message  cannot  be  received  before  it  is  sent 
in  any  temporal  frame  of  reference.  For  example,  the  cuts  c  and  c'  in  Figure  1  are 
consistent  and  inconsistent  cuts,  respectively.  The  channel  states  corresponding  to 
cut  c  consists  of  one  message  in  the  channel  from  p  to  q,  and  one  in  the  channel 
from  s  to  r.  Readers  are  referred  to  [Chan85]  for  a  formal  discussion  of  consistent 
global  states. 

3.  System  Model 

The  distributed  system  considered  in  this  paper  has  the  following  characteris- 


FIG.  1.  Consistent  and  inconsistent  cuts  in  a  distributed  system. 
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(1)  Processes  do  not  share  memory  or  clocks.  They  communicate  via  messages 
sent  through  reliable  first-’n-first-out  (FIFO)  channels  with  variable  non-zero 
transmission  time. 

(2)  Processes  fail  by  stopping,  and  whenever  a  process  fails,  all  other  processes  are 
informed  of  the  failure  in  finite  time.  We  assume  that  processes’  failures 
never  partition  the  communication  network. 

We  want  to  develop  our  algorithms  under  the  weakest  possible  set  of  assump¬ 
tions.  In  particular,  we  do  not  assume  that  the  underlying  system  is  a  database 
transaction  system  ([Fisc82]  and  [Jose85]).  This  special  case  admits  simpler  solu¬ 
tions:  the  mechanisms  that  ensure  atomicity  of  transactions  can  hide  inconsisten¬ 
cies  introduced  by  the  failure  of  a  transaction.  Furthermore,  we  do  not  assume 
that  processes  are  deterministic:  this  simplifying  assumption  is  made  in  previous 
results  (e.g.,  [Stro85]  and  [Jose85]). 

4.  Identification  of  Problems 

A  checkpoint  is  a  saved  state  of  a  process.  A  set  of  checkpoints,  one  per  process 
in  the  system,  is  consistent  if  the  saved  states  form  a  consistent  global  state.  For 
example,  consider  the  system  history  in  Figure  2.  Process  p  takes  a  checkpoint  at 
time  X  and  sends  a  message  to  q  some  time  later.  After  receiving  this  message,  q 
takes  a  checkpoint  at  time  Y.  Subsequently,  p  fails  and  restarts  from  the  check¬ 
point  taken  at  X.  The  global  state  at  p’s  restart  is  inconsistent  because  p’s  local 
state  shows  that  no  message  has  been  sent  to  q,  while  q’s  local  state  shows  that  a 
message  from  p  has  been  received.  If  p  and  q  are  processes  supervising  a 
customer’s  accounts  at  different  banks,  and  the  message  transfers  funds  from  p  to 
q ,  the  customer  will  have  the  funds  at  both  banks  when  p  restarts.  This  incon¬ 
sistency  persists  even  if  q  is  forced  to  roll  back  and  restart  from  its  checkpoint 
taken  at  Y. 


FIG.  2.  Inconsistent  checkpoints. 
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'Hie  problem  of  ensuring  that  the  system  recovers  to  a  consistent  state  after 
transient  failures  has  two  components:  checkpoint  creation  and  rollback-recovery; 
we  examine  each  one  in  turn. 

4.1.  Checkpoint  Creation 

There  are  two  approaches  to  creating  checkpoints.  With  the  first  approach, 
processes  take  checkpoints  independently  and  save  all  checkpoints  on  stable 
storage.  Upon  a  failure  processes  must  find  and  agree  upon  a  consistent  set  of 
checkpoints  among  the  saved  ones.  The  system  is  then  rolled  back  to  and  restarted 
from  this  set  of  checkpoints  [Ande79,  Russ80,  Wood81,  Hadz82J. 

With  the  second  approach,  processes  coordinate  their  checkpointing  actions 
such  that  each  process  saves  only  its  most  recent  checkpoint,  and  the  set  of  check¬ 
points  in  the  system  is  guaranteed  to  be  consistent.  When  a  failure  occurs,  the  sys¬ 
tem  restarts  from  these  checkpoints  [Tami84]. 

A  disadvantage  of  the  first  approach  has  long  been  recognized  [Rand75, 
Pres831  and  is  named  the  "domino  effect”.  We  illustrate  this  effect  in  Figure  3.  In 
this  example,  processes  p  and  q  have  independently  taken  a  sequence  of  check¬ 
points..  The  interleaving  of  messages  and  checkpoints  leaves  no  consistent  set  of 
checkpoints  for  p  and  q,  except  the  initial  one  at  {jf0  YQ}.  Consequently,  after  p 
fails,  both  p  and  q  must  roll  back  to  the  starting  point  of  the  computation.  For 
time-critical  applications  that  require  a  guaranteed  rate  of  progress,  such  as  real 
time  process  control,  this  behavior  results  in  unacceptable  delays.  An  additional 
disadvantage  of  independent  checkpoints  is  the  large  amount  of  stable  storage 
required  for  the  saved  states. 

To  avoid  these  drawbacks,  we  pursue  the  second  approach.  In  contrast  to 
[Tami84],  our  method  ensures  that  when  a  process  takes  a  checkpoint,  a  minimal 
number  of  additional  processes  are  forced  to  take  checkpoints. 


Xi>  Xt  X2  Xj  failure 


Y0  Y,  Y2  Y, 

FtG.  3.  " Domino  effect "  following  a  failure. 
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4.2.  Rollback- Recovery 

Rollback-recovery  from  a  consistent  set  of  checkpoints  appears  deceptively 
simple.  The  following  scheme  seems  to  work:  Whenever  a  process  rolls  back  to  its 
checkpoint,  it  notifies  all  other  processes  to  also  roll  back  to  their  respective  check¬ 
points.  It  then  installs  its  checkpointed  state  and  resumes  execution.  Unfor¬ 
tunately,  this  simple  recovery  method  has  a  major  flaw.  In  the  absence  of  syn¬ 
chronization,  processes  cannot  all  recover  (from  their  respective  checkpoints)  simul¬ 
taneously.  Recovering  processes  at  different  times  introduces  a  liveness  problem  as 
illustrated  below. 

Consider  two  processes  p  and  q.  Figure  4  illustrates  their  histories  up  to  the 
time  p  fails.  Process  p  fails  before  receiving  the  message  n1(  rolls  back  to  its 
checkpoint,  and  notifies  q.  Then  p  recovers,  it  sends  m2  and  receives  nL.  After  p’s 
recovery,  p  has  no  record  of  sending  m1(  while  q  has  a  record  of  its  receipt.  There¬ 
fore,  the  global  state  is  inconsistent.  To  restore  consistency,  q  must  also  roll  back 
(to  "forget”  the  receipt  of  mf),  and  notify  p.  Note  that  after  q  rolls  back,  q  has  no 
record  of  sending  nl  while  p  has  a  record  of  its  receipt.  Hence,  the  global  state  is 
inconsistent  again,  and  upon  notification  of  q’s  rollback,  p  must  roll  back  a  second 
time.  After  q  recovers,  q  sends  n2  and  receives  m2.  Suppose  p  rolls  back  before 
receiving  n2  as  shown  in  Figure  5.  With  the  second  rollback  of  p,  the  sending  of 
m2  is  "forgotten”.  To  restore  consistency,  q  must  roll  back  a  second  time.  After  p 
recovers  it  receives  n2,  and  upon  notification  of  q's  rollback,  it  must  roll  back  a 
third  time.  It  is  now  clear  that  p  and  q  can  be  forced  to  roll  back  forever,  even 
though  no  additional  failures  occur. 

Our  rollback-recovery  algorithm  solves  this  liveness  problem.  It  tolerates 
failures  that  occur  during  its  execution,  and  forces  a  minimal  number  of  processes 
to  roll  back  after  a  failure.  In  [Tami84],  a  single  failure  forces  the  system  to  roll 


FIG.  4.  Histories  of  p  and  q  up  to  p  s  failure. 
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back  as  a  whole.  Furthermore,  the  system  crashes  (and  does  not  recover)  if  a 
failure  occurs  while  it  is  rolling  back. 

5.  Checkpoint  Creation 

5.1.  Naive  Algorithms 

From  Figure  2  it  is  obvious  that  if  every  process  takes  a  checkpoint  after 
every  sending  of  a  message,  and  these  two  actions  are  done  atomically,  the  set  of 
the  moat  recent  checkpoints  is  always  consistent.  But  creating  a  checkpoint  after 
every  send  is  expensive.  We  may  naively  reduce  the  cost  of  the  above  method  with 
a  strategy  such  as  "every  process  takes  a  checkpoint  after  every  k  sends,  k>ln  or 
"every  process  takes  a  checkpoint  on  the  hour”.  However,  the  former  can  be  shown 
to  suffer  domino  effects  by  a  construction  similar  to  the  one  in  Figure  3,  while  the 
latter  is  meaningless  for  a  system  that  lacks  perfectly  synchronized  clocks. 

5.2.  Classes  of  Checkpoints 

Our  algorithm  saves  two  kinds  of  checkpoints  on  stable  storage:  permanent 
and  tentative.  A  permanent  checkpoint  cannot  be  undone.  It  guarantees  that  the 
computation  needed  to  reach  the  checkpointed  state  will  not  be  repeated.  A  tenta¬ 
tive  checkpoint,  ho*  3ver,  can  be  undone  or  changed  to  be  a  permanent  checkpoint. 
When  the  context  is  clear,  we  call  permanent  checkpoints  "checkpoints”. 

Consider  a  system  with  a  consistent  set  of  permanent  checkpoints.  A  check¬ 
point  algorithm  is  resilient  to  failures  if  the  set  of  permanent  checkpoints  in  the 
system  is  still  consistent  after  the  algorithm  terminates,  even  if  some  processes  fail 
during  its  execution.  Consider  systems  where  processes  cannot  afford  to  take  a 
checkpoint  after  every  send,  or  systems  where  processes  cannot  combine  the  send¬ 
ing  of  a  message  and  the  taking  of  a  checkpoint  atomically.  For  these  systems, 
checkpoint  algorithms  must  store  at  least  two  checkpoints  in  stable  storage  in 
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order  to  be  resilient  to  failures. 

Theorem  1:  No  resilient  checkpoint  algorithms  exist  chat  take  only  permanent 
checkpoints. 

Proof:  By  contradiction.  Suppose  such  an  algorithm  A  exists.  Consider  the 

following  scenario:  p  and  q  are  processes  in  the  system.  Suppose  that 
by  time  t,  t>0,  p  has  received  a  message  mq  from  q,  and  q  a  message 
nip  from  p.  At  time  t,  process  p  decides  to  use  A  to  take  a  checkpoint. 
Let  A  finish  by  time  t' ,  and  suppose  process  p  takes  a  permanent 
checkpoint  Cp^  at  time  tp,  such  that  t<tp<t' .  Since  the  set  of  check¬ 
points  at  the  termination  of  A  must  be  consistent,  process  q  must  also 
have  taken  a  permanent  checkpoint  C?>tj  at  time  tq,  such  that 
t<tq<t' .  Let  d  be  the  minimum  time  required  for  the  failure  of  a  pro¬ 
cess  to  be  detected.  Depending  on  whether  t p^tq  or  tp>tq,  we  con¬ 
struct  another  run  of  A  in  which  one  process  fails,  to  show  that  A  is 
not  resilient. 

Case  1:  tp^tq.  Let  q  fail  in  the  time  interval  (maxU,  tq-d),  tq).  Pro¬ 
cess  p  discovers  the  failure  after  tq,  hence  after  tp.  (See  Figure  6.)  Con¬ 
sequently,  CPitp  is  taken  although  Cq^  is  not.  Since  Cp  t^  is  a  per¬ 
manent  checkpoint  that  cannot  be  undone,  and  q  fails  before  making  a 
permanent  checkpoint,  the  sending  of  mq  is  "forgotten”  forever  while 
the  receipt  of  mq  is  always  "remembered”,  no  matter  what  A  does 
after  p  detects  the  failure.  Hence,  contrary  to  our  assumption,  Algo¬ 
rithm  A  is  not  resilient. 

Case  2:  tp>tq.  Let  p  fail  in  the  time  interval  (maxU,  tp-d),  tp).  The 
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rest  of  the  proof  is  analogous  to  Case  1.  Q 

Theorem  1  shows  that  besides  the  impractical  "naive”  algorithm  described  in  sec¬ 
tion.  5.1,  any  resilient  checkpoint  algorithm  must  store  at  least  two  checkpoints  on 
stable  storage. 

53.  Our  Checkpoint  Algorithm 

We  assume  that  the  algorithm  is  invoked  by  a  single  process  that  wants  to 
take  a  permanent  checkpoint.  We  also  assume  that  no  failures  occur  in.  the  system. 
In.  section  5.3.4  we  extend  the  algorithm  to  handle  failures,  and  in  section  7  we 
describe  concurrent  invocations  of  this  algorithm. 

5.3.1.  Motivation 

To  create  consistent  checkpoints,  processes  can  execute  an  algorithm  that  is 
patterned  on  two-phase-commit  protocols.  In  the  first  phase,  the  initiator  q  takes  a 
tentative  checkpoint  and  requests  all  processes  to  take  tentative  checkpoints.  If  q 
learns  that  all  processes  have  taken  tentative  checkpoints,  q  decides  all  tentative 
checkpoints  should  be  made  permanent;  otherwise,  q  decides  tentative  checkpoints 
should  be  discarded.  In  the  second  phase,  q’s  decision  is  propagated  and  carried  out 
by  all  processes.  Since  all  or  none  of  the  processes  take  permanent  checkpoints,  the 
most  recent  set  of  checkpoints  is  always  consistent. 

However,  our  goal  is  to  force  a  minimal  number  of  processes  to  take  check¬ 
points.  The  above  algorithm  is  modified  as  follows:  A  process  p  takes  a  tentative 
checkpoint  after  it  receives  a  checkpoint  request  from  q  only  if  q’s  tentative  check¬ 
point  records  the  receipt  of  a  message  from  p,  while  p’s  latest  permanent  check¬ 
point  does  not  record  the  sending  of  that  message.  Process  p  determines  whether 
this  condition  is  true  using  the  label  appended  to  q’s  request.  This  labeling  scheme 
is  described  below. 

Messages  that  are  not  sent  by  the  checkpoint  or  rollback-recovery  algorithms 
are  system  messages.  Every  system  message  m  contains  a  label  m.l.  Each  process 
appends  outgoing  system  messages  with  monotonically  increasing  labels.  We 
define  J_  and  T  to  be  the  smallest  and  largest  labels,  respectively.  For  any 
processes  r  and  p,  let  m  be  the  last  message  that  r  received  from  p  after  r  took  its 
last  permanent  or  tentative  checkpoint  Define: 

[mi  if  m  exists 

UuLsnugtf)  =  otilerwl3e  • 

Also,  let  m  be  the  first  message  that  r  sent  to  process  p  after  r  took  its  last 


permanent  or  tentative  checkpoint.  Define: 

m.l  if  m  exists 

firstsmsg.ip)  =  ±  otherwise  • 

When  q  requests  p  to  take  a  tentative  checkpoint,  it  appends  last-rmsgq(p )  to  its 
request;  p  takes  the  checkpoint  if  L<firstsmsg p[q)^last-rmsgq(p). 

5.3.2.  Description 

Process  p  is  a  ckpt-cohort  of  q  if  q  has  taken  a  tentative  checkpoint,  and 
last-rmsgq{p)>  L  before  the  tentative  checkpoint  was  taken.  The  set  of 
ckpt_cohorts  of  q  is  denoted  ckpt-cohort  .  Every  process  p  keeps  a  variable 
willing -tO-ckptp  to  denote  its  willingness  to  take  checkpoints.  Whenever  p  cannot 
be  interrupted  to  run  the  checkpoint  algorithm,  willing-to-ckpt _  is  "no”.  The  ini¬ 
tiator  q  starts  the  checkpoint  algorithm  by  making  a  tentative  checkpoint  and 
sending  a  request  "take  a  tentative  checkpoint  and  last-rmsgq(p)”  to  all 
p ickpt-cohort q.  A  process  p  inherits  this  request  if  willing-to-ckpt p  is  "yes”  and 
lastsmsgq(p)^firstsmsgp{q)>±.  After  p  inherits  a  request,  it  takes  a  tentative 
checkpoint  and  sends  "take  a  tentative  checkpoint  and  last-rmsg p(r)”  requests  to 
all  rickpt-cohort p.  If  p  receives  but  does  not  inherit  a  request  from  q,  p  replies 
willing-to-ckptp  to  q. 

After  p  sends  out  its  requests,  it  waits  for  replies  that  can  be  either  "yes”  or 
"no”,  indicating  a  ckpt_cohort’s  acceptance  or  rejection  of  p’s  request.  If  at  least  one 
reply  is  "no”,  willing-to-ckpt  p  becomes  "no”;  otherwise  willing-to-ckpt p  is 
unchanged.  Process  p  then  sends  willing-to-ckpt p  to  the  process  whose  request  p 
has  inherited. 

If  all  the  replies  from  its  ckpt_cohorts  arrive  and  are  all  "yes”,  the  initiator 
decides  to  take  all  tentative  checkpoints  permanent.  Otherwise  the  decision  is  to 
undo  all  tentative  checkpoints.  This  decision  is  propagated  in  the  same  fashion  as 
the  request  "take  a  tentative  checkpoint”  was  delivered.  Between  the  times  a  pro¬ 
cess  p  takes  a  tentative  checkpoint  and  it  receives  the  decision  from  the  initiator,  p 
does  not  send  any  system  messages.  Also,  after  processes  take  new  permanent 
checkpoints,  they  may  discard  their  previous  checkpoints. 

The  algorithm  is  presented  in  Figure  7.  For  simplicity,  we  create  a  fictitious 
process  called  daemon  to  assume  the  initiation  and  decision  tasks  of  *he  initiator. 
In  practice,  daemon  is  a  part  of  the  initiator  process. 


await  does  not  prevent  a  process  from  receiving  messages. 
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Daemon  process: 

send(initiator,  "take  a  tentative  checkpoint  and  T”); 
await(initiatorT  reply);1 
if  reply  —  "yes”  then 

send(  initiator,  "take  tentative  checkpoint  permanent’) 

else 

send( initiator,  "undo  tentative  checkpoint?) 

ft. 


UPON  RECEIPT  OP  "take  a  tentative  checkpoint  and  lastsmsgq(p)n  from  q  DO 
if  willing  Jtos:kptp  and  last-rmsgq(p)^firstsmsg p(.q)  >1.  then 
take  a  tentative  checkpoint; 

for  all  v€ckpL-cohortp,  sen d(v,  "take  a  tentative  checkpoint  and  last-rmsgp(u)”); 
for  all  utckpLxohort p,  await(o,  willingJLoskptv); 

if  3  v€ckpt-cohortp,  willing-to-ckpt„  —  "no”  then  willing Jtoskptp+-  "no”  ft; 

ft; 

send(q,  willing-toskpt p); 

od. 

UPON  FIRST  RECEIPT  OP  m  ="take  tentative  checkpoint  permanent’  or 
m  ="undo  tentative  checkpoint”  DO 
if  m  ="take  tentative  checkpoint  permanent”  then 
take  tentative  checkpoint  permanent; 
else 

undo  tentative  checkpoint; 

ff; 

for  ail  v€ckptsohortp,  send(o,  m); 
od. 


All  processes  a: 


INITIAL  STATE: 
firsts  msg  pidaemon)  —  T; 

"yes”  if  p  is  willing  to  take  a  checkpoint 
"no”  otherwise 


willing-toskpt _  = 


PIG.  7.  Algorithm.  Cl:  the  Checkpoint  Algorithm 


2 


5.3.3.  Proof  of  Correctness 

We  consider  a  single  invocation  of  the  algorithm,  and  we  assume  no  process 

fails  in  the  system. 

Lemma  2.  Every  process  inherits  a  request  to  take  a  tentative  checkpoint  at  most 
once. 

Proof :  Immediately  after  a  process  p  inherits  a  request  it  takes  a  tentative 

checkpoint.  From  the  time  p  takes  this  checkpoint  to  the  time  it 
receives  the  initiator’s  decision,  p  does  not  send  any  system  messages. 
Therefore,  during  this  interval  of  time,  firstsmsgp(q)  =  ±.  for  all  q. 
Process  p  does  not  inherit  additional  requests  during  the  execution  of 
the  algorithm.  Q 

Lemma  3:  Every  process  terminates  its  execution  of  Algorithm  Cl. 

Proof:  Any  process  that  executes  Cl  without  making  a  tentative  checkpoint 

clearly  terminates.  Let  p  be  a  process  that  takes  a  tentative  check¬ 
point.  By  lemma  2,  p  inherits  a  request  to  take  a  tentative  checkpoint 
at  most  once.  Consequently,  to  prove  that  Cl  terminates  at  p,  it  suffices 
to  prove  that  after  p  takes  a  tentative  checkpoint,  it  does  not  wait  for¬ 
ever  for  either  the  "yes"  or  "no”  from  its  ckpt_cohorts,  or  the  initiator’s 
decision. 

Let  q  be  a  ckpL_cohort  of  p.  If  q  inherits  p’s  request  to  take  a  tentative 
checkpoint,  it  sends  willing Jio-ckpt  q  to  p  before  it  waits  for  the 
initiator’s  decision.  If  <7,  on  the  other  hand,  does  not  inherit  p’s  request, 
it  sends  p  willing-io-ckpt q  immediately  after  receiving  p’s  request. 
Therefore,  there  can  be  no  deadlock  involving  p  waiting  for  replies  from 
its  ckpt_cohorts  and  a  ckpt^cohort  of  p  waiting  for  the  initiator’s  deci¬ 
sion. 

Suppose  that  p  is  in  a  deadlock  waiting  for  replies  from  its 
ckpt-cohorts.  Then  there  exists  a  circular  chain  of  processes  p=p 0, 
•••,  pk  (6>1)  such  that  for  0St<fc,  p;  waits  forever  for  its 
ckpt_cohort,  pt 1  mod  to  send  willing -to-ckptp  .  If  pt  waits  for¬ 
ever  for  Pt+imodk,  pt+\  mod  k  must  have  inherited  a  request  from  p,. 
Since  the  initiator  does  not  inherit  any  requests,  it  is  not  in  the  chain. 
And  3ince  there  is  only  one  initiator,  there  must  exist  a  process  q  such 
that  for  some  t,  p,  inherits  a  request  from  q,  and  q*p,  for  all  i .  But  p, 


•  "  O  O 
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contradicts  lemma  2  by  inheriting  two  requests:  one  from  q  and  one 
from  Pi-i  mod  k-  Consequently,  no  deadlock  can  exist  and  p  will  receive 
replies  from  all  its  ckpt_cohorts. 

Since  every  process  receives  replies  from  all  its  ckpt_cohorts,  the  initia¬ 
tor  will  receive  replies  from  all  its  ckpt_cohorts  to  decide  on  the  tenta¬ 
tive  checkpoints.  Its  decision  is  guaranteed  to  reach  ail  processes  that 
have  taken  tentative  checkpoints  because  all  processes  will  pass  on  the 
decision  and  messages  are  always  delivered.  Thus  we  have  shown  that 
no  process  waits  forever  for  replies  from  its  ckpt-cohorts  or  the 
initiator’s  decision.  O 

The  next  lemma  shows  that  Cl  takes  a  consistent  set  of  checkpoints. 

Lemma  4:  If  the  set  of  checkpoints  in  the  system  is  consistent  before  the  execution 
of  Algorithm  Cl,  the  set  of  checkpoints  in  the  system  is  consistent  after 
the  termination  of  CL 

Proof:  Without  1ms  of  generality,  assume  new  checkpoints  are  taken  in  Cl. 

The  proof  is  by  contradiction.  Suppose  the  set  of  checkpoints  after  Cl 
ternunatea  is  not  consistent.  Then  there  must  exist  two  processes  p  and 
q  such  that  p  sent  q  a  message  m  after  making  its  permanent  check¬ 
point;  and?  received  m  before  making  its  permanent  checkpoint.  Since 
all  checkpoints  are  consistent  before  the  execution  of  Cl,  q  must  have 
taken  its  permanent  checkpoint  during  this  execution.  Before  q  took  a 
tentative  checkpoint  in  Cl,  last-rmsgq(p)  ^rrt.l;  therefore,  p  was  in 
ckpt-cohortq  and  received  a  request  to  take  a  tentative  checkpoint  from 
q.  When  p  received  the  request,  willing-to-xkpt  p  had  to  be  "yes” 
because  q  cannot  have  taken  its  tentative  checkpoint  permanent  other¬ 
wise.  Moreover,  if  p  had  not  taken  a  tentative  checkpoint  when  q’s 
request  arrived,  last-rmsgq{p)&firstsmsgp(q )  because 
firstLsmsgp(q)^Sm.[.  Hence,  process  p  took  a  tentative  checkpoint  after 
sending  m.  Process  p,  however,  must  take  its  tentative  checkpoint  per¬ 
manent  if  q  takes  its  permanent.  Consequently  p  takes  a  permanent 
checkpoint  after  sending  m,  a  contradiction.  Q 

We  now  show  that  the  nunber  of  processes  that  take  new  permanent  check¬ 
points  during  the  execution  of  Algorithm  Cl  is  minimal.  Let  P={p0,  Pu  ’ '  ’ »  Pk\ 
be  the  set  of  processes  that  take  new  permanent  checkpoints  in  Cl,  where  p<>  is  the 
initiator  of  CL  Let  CtP)={c(p0),  c{p{),  •  •  • ,  c(p*)}  be  the  permanent  checkpoints 
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taken  by  processes  in  P.  Define  an  alternate  set  of  checkpoints: 
C'{P)-[c'{p0),c'{pi),  ■  ■  ■  ,  c'{pk)\  where  c'(p0)  =  c(p0)  and  for  lsi^/t,  c'(p;)  = 
either  c(p,)  or  the  checkpoint  p,  had  before  executing  Cl. 

Theorem  5:  C'(P)  is  consistent  if  and  only  if  C'(P)=C(P). 

Proof:  Without  loss  of  generality,  assume  |P|s2.  The  if  part  is  by  lemma  4. 

We  show  the  only  if  part  by  contradiction.  Suppose  C'(P)*C(P)  and 
C'(P)  is  consistent.  Then  there  exists  a  nonempty  subset  Q  of  P  such 
that  for  all  process  q  in  Q,  c'(q)*c(q).  For  any  processes  p  and  q,  if  p 
inherits  a  checkpoint  request  from  q,  q’s  tentative  checkpoint  is  taken 
before  p’s.  Therefore,  the  inherit  relation  is  non-circular.  Because  of 
this  non-circularity  and  the  fact  that  the  initiator  is  in  Q  (since 
c'(p0)=c(p0)),  there  exists  p^Q  such  that  p,  inherits  a  checkpoint 
request  from  another  process  PjiQ-  Since p,6P  implies  p;6P,  we  know 
that  c'ipj)  =  dp/. 

When  Pi  inherits  p/s  request,  lastwmsgp{p/  s  firstsmsg p(pj)> L. 
There  exists  a  message  m  such  that  last-rmsgp{pi)  =  m.l.  In  C'(P),  the 
sending  of  m  is  not  recorded  in  c’{p/  since  m.l  ^  firsts  ms g  p  (p/,  but  the 
receipt  of  m  is  recorded  in  c'(p7).  Contrary  to  the  assumption,  C'(P)  is 
not  consistent.  □ 

Theorem  5  shows  that  if  p0  takes  a  checkpoint,  then  all  processes  in  P  must  take  a 
checkpoint  to  ensure  global  consistency. 

5.3.4.  Coping  with  Failures 

We  now  extend  Algorithm  Cl  to  handle  processes'  failures.  We  first  consider 
the  effects  of  failures  on  non-faulty  processes.  When  failures  occur,  a  non-faulty 
process  may  receive  zero  or  more  of  the  following  messages: 

(1)  "yes”  or  "no”  from  ckpt_cohorts, 

(2)  "take  tentative  checkpoint  permanent”  or  "undo  tentative  checkpoint”  from 
the  initiator. 

Suppose  process  p  fails  before  replying  "yes”  or  "no”  to  process  q’s  request.  By 
the  assumptions  of  section  3,  q  will  know  of  p’s  failure.  Process  q  can  then  assume 
that  p  is  unwilling  to  take  a  permanent  checkpoint.  This  assumption  is  correct 
even  if  p  has  taken  a  tentative  checkpoint  before  it  fails,  as  long  as  p  undoes  its 
tentative  checkpoint  when  it  recovers  (see  section  5.5).  Therefore,  to  take  care  of 
the  case  of  a  missing  "yes”  or  "no”,  it  suffices  to  change  the  line  in  Cl  from 
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if  3  otckpt-cohortp,  willing Jtou:kpt0  =  "no"  then  willing_to-ckptp*—  "no”  ft 
to 

if  3  vtckpt-cohortp,  willing Jto-£kpt„  =  "no”  or  u  has  failed  then 
willing JLo-ckpt p4-  "no”  ft. 

Suppose  that  a  process  p  does  not  receive  the  decision  regarding  its  tentative 
checkpoint.  If  p  undoes  its  tentative  checkpoint  or  takes  it  permanent,  it  risks  con¬ 
tradicting  the  initiator.  A  common  practice  in  this  situation  is  to  have  p  blocked 
until  it  discovers  the  initiator's  decision  [Skee82I.  We  will  discuss  ways  to  obviate 
blocking  in  section  ft. 

We  now  consider  the  recovery  of  faulty  processes.  When  a  process  restarts 
after  a  failure,  its  latest  checkpoint  on  stable  storage  may  be  tentative  or  per¬ 
manent.  If  this  checkpoint  is  tentative,  the  recovering  process  must  decide  whether 
to  discard  it  or  to  take  it  permanent.  The  decision  is  made  as  follows: 

Suppose  the  recovering  process  is  the  initiator.  The  initiator  knows  that  every 
process  that  has  taken  a  tentative  checkpoint  is  still  blocked  waiting  for  its  deci¬ 
sion.  Hence  it  is  safe  for  the  initiator  to  decide  to  undo  the  tentative  checkpoints 
and  send  this  decision  to  its  ckptocohorts. 

If  the  recovering  process  is  not  the  initiator,  it  must  discover  the  initiator’s 
decision  regarding  tentative  checkpoints.  It  may  contact  either  the  initiator  or 
those  processes  of  which  it  is  a  ckpt_cohort;  it  follows  the  decision  accordingly. 

How  the  recovering  process  is  left  with  one  permanent  checkpoint  on  stable 
storage.  Recovery  is  complete  when  it  uses  the  rollback-recovery  algorithm  to  be 
presented  in  section  .6  to  restart  from  this  checkpoint 

Let  C2  be  the  Algorithm  Cl  as  modified  above.  C2  terminates  if  all  processes 
that  fail  during  the  execution  of  C2  recover.  At  termination,  the  set  of  checkpoints 
in  the  system  is  consistent  and  the  number  of  processes  that  took  new  permanent 
checkpoints  is  minimal.  The  proofs  for  these  properties  are  similar  to  those  of  Cl 
and  are  omitted. 

ft.  Rollback-Recovery 

We  assume  that  the  algorithm  is  invoked  by  a  single  process  that  wants  to  roll 
back  and  recover  (henceforth,  denoted  restart ).  We  also  assume  that  the  checkpoint 
algorithm  and  the  rollback-recovery  algorithm  are  not  invoked  concurrently.  Con¬ 
current  invocations  of  the  algorithms  are  described  in  section  7. 


6.1.  Motivation 

The  rollback-recovery  algorithm  is  patterned  on  two-phase-commit  protocols. 
In  the  first  phase,  the  initiator  q  requests  all  processes  to  indicate  their  willingness 
to  restart  from  their  checkpoints.  Process  q  decides  to  restart  all  the  processes  if 
and  only  if  they  are  all  willing  to  restart.  In  the  second  phase,  q’s  decision  is  pro¬ 
pagated  and  carried  out  by  all  processes.  We  will  prove  that  the  two-phase  struc¬ 
ture  of  this  algorithm  prevents  the  liveness  problem  discussed  in  section  4.2.  Since 
all  or  none  of  the  processes  restart,  when  the  rollback-recovery  algorithm  ter¬ 
minates  the  global  state  is  consistent. 

However,  our  goal  is  an  algorithm  that  rolls  back  a  minimal  number  of 
processes  in  order  to  recover  from  a  failure.  If  a  process  p  rolls  back  to  a  state 
saved  before  an  event  e  occurred,  we  say  that  e  is  undone  by  p.  With  our  algo¬ 
rithm,  process  p  must  restart  only  if  q’s  rollback  will  undo  the  sending  of  a  mes¬ 
sage  to  p.  Process  p  determines  if  it  must  restart  using  the  label  appended  to  q’s 
request. 

For  any  processes  r  and  p,  let  m  be  the  last  message  that  r  sent  to  p  before  r 
took  its  latest  permanent  checkpoint.  Define 

\nz.l  if  m  exists 

lastsmsgr{p)  =  |T  oUleraiae  ■ 

When  q  requests  p  to  restart,  it  appends  lastsmsgq(p)  to  its  request.  Process  p 
restarts  from  its  permanent  checkpoint  if  last-rmsgp(q)>lastsmsgq(p). 

6.2.  Description 

Process  p  is  a  roll -cohort  of  q  if  q  can  send  messages  to  it.  The  set  of  roll- 
cohorts  of  q  is  roll  —cohort q2 .  Every  process  p  keeps  a  variable  willing-tosoll p  to 
denote  its  willingness  to  roll  back.  The  initiator  q  starts  the  rollback-recovery  algo¬ 
rithm  by  sending  a  request  "prepare  to  roll  back  and  lastsmsgq{p)”  to  all 
p troll— cohort q.  A  process  p  inherits  this  request  if  willing Jtosoll p  is  "yes", 
last-rmsgp{q)>lastsmsgq{p),  and  p  has  not  already  inherited  another  request  to 
roll  back.  After  p  inherits  the  request,  it  sends  "prepare  to  roll  back  and 
lasts msg p{r)”  to  all  r troll— cohort p;  otherwise,  it  replies  willing_to-rollp  to  q. 

2The  relationship  between  roll— cohort  and  ckpt-cohnrt  is  not  symmetric.  If  p  is  a  ckpt -cohort  of 
q,  lastwmsp j{p)>  1.  and  q  must  then  be  a  roll -cohort  of  p.  On  the  other  hand,  it  is  possible  that 
pickpL-cohort ,  but  q£  roll  —cohort ,  because  p  can  but  does  not  send  messages  to  q. 
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After  p  sends  out  its  requests,  it  waits  for  replies  from  each  process  in 
roll— cohort p.  The  reply  can  be  an  explicit  "yes”  or  "no”  message,  or  an  implicit 
"no”  when  p  discovers  that  r  has  failed.  If  at  least  one  reply  is  "no”, 
willing-to-roll p  becomes  "no”,  otherwise  willing-to-roll p  is  unchanged.  Process  p 
then  sends  unlling-to-roll p  to  the  process  whose  request  p  inherits.  Between  the 
times  p  inherits  the  rollback  request  and  it  receives  the  decision  from  the  initiator, 
it  does  not  send  any  system  messages. 

If  all  the  replies  from  its  roll-cohorts  arrive  and  are  all  "yes”,  the  initiator 
decides  the  rollbacks  will  proceed,  otherwise  it  decides  no  process  will  roll  back. 
This  decision  is  propagated  to  all  processes  in  the  same  fashion  as  the  request 
"prepare  to  roll  back”  is  delivered.  Process  p  blocks  waiting  for  the  discovery  of  the 
initiator’s  decision,  if  failures  prevent  the  decision  from  reaching  p.  We  discuss 
non-blocking  algorithms  in  section  8. 

The  rollback-recovery  algorithm  is  presented  in  Figure  8.  Like  the  presenta¬ 
tion  of  Algorithm  Cl,  we  introduce  a  fictitious  process  called  daemon  to  perform 
functions  that  are  unique  to  the  initiator  of  the  algorithm. 

6.3.  Proof  of  Correctness 

We  first  assume  that  the  rollback-recovery  algorithm  is  invoked  by  a  single 
process  that  wants  to  restart.  The  variable  ready JLo-roll  p  ensures  that  a  process  p 
inherits  at  most  one  request  to  roll  back.  Therefore,  to  prove  the  termination  of 
Algorithm  R,  it  suffices  to  show  that  Algorithm  R  is  free  of  deadlock  and  it  rolls 
each  process  back  at  most  once. 

Lemma  6:  Algorithm  R  is  deadlock  free. 

Proof:  Similar  to  the  proof  of  lemma  3.  0 

Lemma  7:  Every  process  in  the  system  rolls  back  at  most  once. 

Proof:  Without  loss  of  generality,  assume  that  the  initiator  decides  to  roll 

back.  The  initiator  receives  replies  from  all  its  roll-cohorts  only  after 
all  processes  have  received  replies  from  all  their  respective  roll-cohorts. 
Therefore,  should  a  process  p  receive  a  rollback  request  from  another 
process  q  after  p  has  received  the  initiator  decision,  the  initiator  must 
have  decided  to  roll  back  before  it  received  all  the  replies  from  its  roll- 
cohorts,  an  impossibility.  Q 

We  next  show  that  for  each  send  event  that  is  undone  in  Algorithm  R,  its 
corresponding  receive  event  is  also  undone. 
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Daemon  process: 

send(initiator,  "prepare  to  roll  back  and  1”); 
await!  initiator,  reply: 
if  reply  =  "yes”  then 
send!  initiator,  "roll  back”'' 

else 

% 

send!  initiator,  "do  not  roll  back”) 

fi. 


All  processes  p: 


Initial  State: 


ready  ^tosoll ,  =  true; 
last^rmsgn[daemon)  =  T; 


ivdling_tosoll  , 


"yes”  if  p  is  willing  to  roll  back 
"no”  otherwise 


UPON  RECEIPT  OF  "prepare  to  roll  back  and  lastsmsg^p)”  from  q  DO 
if  willing-to-roll „  and  last^rmsg p(q)  > last^smsg ^(p)  and  ready -touroll  0  then 
ready -to-roll  ?  *-  false; 

for  all  r troll  —cohort p,  send(r,  "prepare  to  roll  back  and  last-$m$gp{r)”); 
for  all  r  troll -cohort  „  awaitlr,  willing_to-rolL)\ 
if  3  r  t  roll-cohort  „,  mlling_tosollr  =  "no”  or  r  has  failed 
then  willing -lo-roll "no”  fi; 

ft; 

sendfq,  trilling-to-roll n)\ 

od; 


Upon  receipt  of  m  =”roll  back”  or 
m  ="do  not  roll  back”  and  ready  _to_roll  =  true  DO 
if  m  =  "roll  back”  then 
roll  back  to  p's  permanent  checkpoint; 
else 

resume  normal  execution; 

ft; 

for  all  r troll -cohort sendlr,  m); 

od: 


FIG  8.  Algorithm  R.  the  Rollback  Algorithm 


Lemma  8:  After  every  process  has  terminated  its  execution  of  Algorithm  R>  for 
each,  send  event  that  was  undone,  its  corresponding  receive  event  was 
also  undone. 

Proof.  Without  loss  of  generality,  assume  that  the  initiator  decides  to  roll 
back.  The  proof  is  by  contradiction.  Suppose  that  after  Algorithm  R 
terminates,  there  exists  a  message  m  such  that  the  receiver  p  did  not 
undo  the  receipt  of  nt  while  the  sender  q  undid  the  sending  of  nt.  First, 
we  show  that  p  inherited  a  request  to  roll  back.  Since  q  cannot  send 
system  messages  after  inheriting  a  rollback  request,  q  must  have  sent 
nt  before  inheriting  the  request.  And  since  q  undid  the  sending  of  nt, 
m.l> lasts msgq{p).  Therefore,  when,  p  receives  q’s  request, 
last-rm$gp{q)'&m.l>lcLst-smsgq{p).  hr  addition,  the  variable 
willing-lo-roll p  must  have  been  ''yes”;  otherwise  the  initiator  cannot 
have  decided  to  roll  back.  Consequently,  when  q’s  request  reached  p, 
either  p  had  already  inherited  a  rollback  request  or  it  inherited  q’s 
request. 

Next  we  show  that  p  undid  the  receipt  of  nt.  Since  p  and  q  received 
the  same  decision,  p  rolled  back.  There  are  two  cases  to  consider: 

Case  1:  nt  reached  p  after  p  inherited  a  rollback  request.  Obvious. 

Case  2:  m  reached  p  before  p  inherited  a  rollback  request.  The  receipt 
of  m  was  not  undone  only  if  after  receiving  m  and  before  inheriting  a 
rollback  request,  p  took  a  permanent  checkpoint.  However,  if  p  took  a 
permanent  checkpoint  after  receiving  m  while  q  did  not  take  a  per¬ 
manent  checkpoint  after  sending  nt  (since  q  can  undo  the  sending  of 
nt),  lemma  4  will  be  contradicted. 

ha  all  cases,  p  undoes  the  receipt  of  m  when  it  rolls  back,  contradicting 
our  assumption.  Q 

Lastly,  we  show  that  a  minimal  number  of  processes  roll  back  in  Algorithm  R. 

Let  P  be  the  set  of  processes  in  the  system  that  roll  back. 

Theorem  9:  After  Algorithm  R  terminates,  for  each  send  event  that  is  undone,  its 
corresponding  receive  event  is  undone  if  and  only  if  for  all  nonempty 
QCP  such  that  Q  does  not  contain  the  initiator,  all  processes  in  Q  roll 
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Proof.  Without  loss  of  generality,  assume  |P|s2.  The  if  part  is  by  lemma  8. 

We  show  the  only  if  part  by  contradiction.  Suppose  that  there  exists  a 
Q  such  that  even  if  all  processes  in  Q  do  not  roll  back,  for  each  send 
event  that  is  undone  by  Algorithm  R,  its  corresponding  receive  event  is 
undone.  For  any  processes  p  and  q,  if  p  inherits  a  rollback  request 
from  q,  ready J:o-roll q  becomes  true  before  ready -to-roll p  becomes  true. 
Therefore,  the  inherit  relation  is  non-circular.  Because  of  this  non- 
circularity  and  the  fact  that  the  initiator  is  in  Q,  there  exists  qZQ  such 
that  q  inherits  a  rollback  request  from  another  process  p  outside  of  Q. 
Since  q€P,  piP.  When  q  inherits  p’s  request, 
last-rmsgq(p)>lastsmsgp(q).  Let  m  be  the  message  such  that 
m.l^lastu-msg q(p).  If  processes  in  Q  do  not  roll  back  while  those  in 
P  —  Q  do,  p  undoes  the  sending  of  m  while  q  does  not  undo  the  receipt 
of  m,  a  contradiction.  Q 

7.  Interference 

In  this  section,  we  consider  concurrent  invocations  of  the  checkpoint  and 
rollback-recovery  algorithms.  An  execution  of  these  algorithms  by  process  p  is 
interfered  with  if  any  of  the  following  events  occur: 

(1)  Process  p  receives  a  rollback  request  from  another  process  q  while  executing 
the  checkpoint  algorithm. 

(2)  Process  p  receives  a  checkpoint  request  from  q  while  executing  the  rollback- 
recovery  algorithm. 

(3)  Process  p,  while  executing  the  checkpoint  algorithm  for  initiator  i,  receives  a 
checkpoint  request  from  q,  but  q’ s  request  originates  from  a  different  initiator 
than  t. 

(4)  Process  p,  while  executing  the  rollback-recovery  algorithm  for  initiator  i, 
receives  a  rollback  request  from  q,  but  qr’s  request  originates  from  a  different 
initiator  than  (. 

One  single  rule  handles  the  four  cases  of  interference:  once  p  starts  the  execu¬ 
tion  of  a  checkpoint  [rollback!  algorithm,  p  is  unwilling  to  take  a  tentative  check¬ 
point  [roll  back]  for  another  initiator,  or  to  roil  back  [take  a  tentative  checkpoint]. 
As  a  result,  in  all  four  cases,  p  replies  "no”  to  q.  We  can  show  that  this  rule 
suffices  to  guarantee  that  all  previous  lemmas  and  theorems  hold  despite  con¬ 
current  invocations  of  the  algorithms.  This  rule  can,  however,  be  modified  to  per¬ 
mit  more  concurrency  in  the  system.  The  modification  is  that  in  case  (1),  instead  of 
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sending  "no-”'  to  q,  p  can  begin  executing  the  rollback-recovery  algorithm  after  it 
finishes  the  checkpoint  algorithm.  We  cannot,  however,  apply  a  similar 
modification  in  case  (2)  lest  deadlock  may  occur. 

8.  Optimization 

When  the  initiator  of  the  checkpoint  or  of  the  rollback-recovery  algorithm  fails 
before  propagating  ita  decision  to  its  cohorts,  it  is  desirable  for  processes  not  to 
block  waiting  for  its  recovery.  To  prevent  processes  from  blocking,  we  can  modify 
our  algorithms  by  replacing  the  underlying  two-phase  commit  protocol  with  a  non- 
blocking  three-phase  commit  protocol  [Skee82J.  However,  non-blocking  protocols 
are  inherently  more  expensive  than  blocking  ones  [Dwor83]. 

We  now  address  the  following  problem;  after  a  ckpt-cohort  q  of  a  process  p 
fails,  p  is  unable  to  take  a  permanent  checkpoint  until  q  recovers  (p  cannot  know  if 
the  latest  checkpoint  of  q  records  the  sendings  of  all  messages  it  received  from  q). 
To  avoid  waiting  for  q’s  recovery,  p  can  remove  q  from  ckpt-cohort p  by  restarting 
from  its  checkpoint  (using  the  rollback- recovery  algorithm).  Thereafter,  process  p 
can  take  checkpoints. 

9.  Message  Loss 

Rollback-recovery  can  cause  message  loss  as  illustrated  in  Figure  9.  When  p 
is  rolled  back  to  X  following  a  failure  at  F,  the  global  state  is  consistent,  but  the 
message  m  from  q  is  lost.  It  is  lost  because  the  set  of  checkpoints  {X,  Y} 
corresponds  to  a  consistent  global  state  with  m  in  the  channel. 

One  method  to  circumvent  message  loss  requires  that  processes  use  transmis¬ 
sion  protocols  that  transform  lossy  channels  te  virtual  error-free  channels,  e.g., 
sliding  window  protocols  [Tane81].  Another  method  is  to  ensure  that  the  most 
recent  set  of  checkpoints  corresponds  to  a  consistent  global  state  with  no  messages 


FIG.  9.  Message  loss  following  p’s  rollback  to  X. 


in  the  channels.  We  can  modify  the  checkpoint  and  rollback-recovery  algorithms  to 
satisfy  this  condition,  but  this  modification  increases  the  number  of  processes  that 
are  forced  to  take  checkpoints  and  roll  back. 

10.  Conclusion 

We  have  presented  a  checkpoint  algorithm  and  a  rollback-recovery  algorithm 
to  solve  the  problem  of  bringing  a  distributed  system  to  a  consistent  state  after 
transient  failures.  In  contrast  to  previous  algorithms,  they  tolerate  failures  that 
occur  during  their  executions.  Furthermore,  when  a  process  takes  a  checkpoint,  a 
minimal  number  of  additional  processes  are  forced  to  take  checkpoints.  Similarly, 
when  a  process  restarts  after  a  failure,  a  minimal  number  of  additional  processes 
are  forced  to  restart  with  it.  We  also  show  that  the  stable  storage  requirement  of 
our  algorithms  is  minimal. 
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